DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business green lights the project, they have tasked your data science team to conduct an analysis of existing employee data.
Here I will do a data analysis on a given dataset CaseStudy2-data.csv to identify factors that lead to attrition. I will identify the top three factors that contribute to turnover (backed up by evidence provided by analysis). There may or may not be a need to create derived attributes/variables/features. The business is also interested in learning about any job role specific trends that may exist in the data set (e.g., “Data Scientists have the highest job satisfaction”). I also provide any other interesting trends and observations from the analysis. The analysis will be backed up by robust experimentation and appropriate visualization. Experiments and analysis are conducted in R. I will also build a model to predict attrition.
library(tidyverse) #The "tidyverse" collects some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. The packages work in harmony to clean, process, model, and visualize data.
library(skimr) #for data summary - so sweet and I like a lot this library
library(mice) #package provides a nice function md.pattern() to get a better understanding of the pattern of missing data
library(VIM) #more helpful visual representation can be obtained using the VIM package for agrr
library(naniar) #https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html (for gg_mis_var) (Missing values)
library(mlbench) #collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository. (also has BostonHousing)
library(caret)
library(mlr)
library(tidyverse)
library(ggthemes)
library(gplots)
library(randomForest)
library(corrplot)
library(kableExtra)
library(plotly)
library(GGally) #for ggpairs
library(Boruta) #for Automated EDA later
library(readxl) #read excel
library(e1071) #Naive Bayes
rawdata0 <- read.csv("CaseStudy2-data.csv")
head(rawdata0)
## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## 6 6 27 No Travel_Frequently 294 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## 6 10 2 Life Sciences 1 733
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## 6 4 Male 32 3 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## 6 Manufacturing Director 1 Divorced 8793
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## 6 4809 1 Y No 21
## PerformanceRating RelationshipSatisfaction StandardHours
## 1 3 3 80
## 2 3 1 80
## 3 3 3 80
## 4 3 3 80
## 5 3 3 80
## 6 4 3 80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1 1 8 3 2
## 2 0 21 2 4
## 3 0 10 2 3
## 4 2 14 3 3
## 5 0 6 2 3
## 6 2 9 4 2
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1 5 2 0
## 2 20 7 4
## 3 2 2 2
## 4 14 10 5
## 5 6 3 1
## 6 9 7 1
## YearsWithCurrManager
## 1 3
## 2 9
## 3 2
## 4 7
## 5 3
## 6 7
view(rawdata0) #There are 870 entries, 36 total columns
length(rawdata0) #[1] 36
## [1] 36
skim(rawdata0) #so sweet 0- for data summary
| Name | rawdata0 |
| Number of rows | 870 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 27 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| Over18 | 0 | 1 | FALSE | 1 | Y: 870 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 435.50 | 251.29 | 1 | 218.25 | 435.5 | 652.75 | 870 | ▇▇▇▇▇ |
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.00 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.50 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
| EmployeeNumber | 0 | 1 | 1029.83 | 604.79 | 1 | 477.25 | 1039.0 | 1561.50 | 2064 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.00 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.50 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.00 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80.00 | 80.0 | 80.00 | 80 | ▁▁▇▁▁ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.00 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
rawdata <- rawdata0
Then the dataset has 870 observations and 36 variables.
Actually by skim(rawdata), we can see there is no missing data in the dataset. However, I will introduce some other codes that can be used to check for missing data as a reference. We only need to run one code to check for missing data.
md.pattern(rawdata)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## 870 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## Education EducationField EmployeeCount EmployeeNumber
## 870 1 1 1 1
## 0 0 0 0
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 870 1 1 1 1 1
## 0 0 0 0 0
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 870 1 1 1 1 1
## 0 0 0 0 0
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 870 1 1 1 1 1
## 0 0 0 0 0
## RelationshipSatisfaction StandardHours StockOptionLevel
## 870 1 1 1
## 0 0 0
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 870 1 1 1 1
## 0 0 0 0
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 870 1 1 1 0
## 0 0 0 0
aggr_plot <- aggr(rawdata, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(rawdata), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## ID 0
## Age 0
## Attrition 0
## BusinessTravel 0
## DailyRate 0
## Department 0
## DistanceFromHome 0
## Education 0
## EducationField 0
## EmployeeCount 0
## EmployeeNumber 0
## EnvironmentSatisfaction 0
## Gender 0
## HourlyRate 0
## JobInvolvement 0
## JobLevel 0
## JobRole 0
## JobSatisfaction 0
## MaritalStatus 0
## MonthlyIncome 0
## MonthlyRate 0
## NumCompaniesWorked 0
## Over18 0
## OverTime 0
## PercentSalaryHike 0
## PerformanceRating 0
## RelationshipSatisfaction 0
## StandardHours 0
## StockOptionLevel 0
## TotalWorkingYears 0
## TrainingTimesLastYear 0
## WorkLifeBalance 0
## YearsAtCompany 0
## YearsInCurrentRole 0
## YearsSinceLastPromotion 0
## YearsWithCurrManager 0
gg_miss_var(rawdata, show_pct = TRUE) + labs(title = "Percent missing of the data") + theme(legend.position = "none", plot.title = element_text(hjust = 0.5), axis.title.y = element_text(angle = 0, vjust = 1))
Then the dataset has no missing data.
We observe by skim() or view() that there are some columns without variation. Then we can drop these columns without affecting our analysis. Observing skim(), we see Over18 has all 870 observations with value Y, EmployeeCount has all 870 observations with value 1, StandardHours has all 870 observations with value 80. In addition, 18 years old is a standard working age and 80 hours/week is high (maybe per 2 weeks - employees receive paycheck per 2 weeks). Then we can drop these three columns.
drop_columns <- which(apply(rawdata, 2, function(x) (length(unique(x)) == 1)))
cols <- names(drop_columns)
rawdata <- rawdata[,-drop_columns]
#Actually, we can drop manually by another code as rawdata <- select(rawdata, -c("Over18","EmployeeCount", "StandardHours")) . We will get the same results finally.
skim(rawdata)
| Name | rawdata |
| Number of rows | 870 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 25 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 435.50 | 251.29 | 1 | 218.25 | 435.5 | 652.75 | 870 | ▇▇▇▇▇ |
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.00 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.50 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EmployeeNumber | 0 | 1 | 1029.83 | 604.79 | 1 | 477.25 | 1039.0 | 1561.50 | 2064 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.00 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.50 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.00 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.00 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
By skim(), we can check again the new dataset and all these three columns have been dropped.
I still want to drop the columns ID and EmployeeNumber. These variables are not related to Salary or Attrition and not usefull for our analysis. They are related to individual identity of each employee. After dropping, I will run skim() to check again the dataset.
rawdata <- select(rawdata, -c("ID","EmployeeNumber"))
skim(rawdata)
| Name | rawdata |
| Number of rows | 870 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.0 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.5 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.0 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.0 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.0 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.0 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.0 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.5 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.0 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.0 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.0 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.0 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.0 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.0 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.0 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.0 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.0 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.0 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.0 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.0 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Then now we have 31 columns in the dataset.
I will convert these numeric variables to factor variables.
factorcolumns <- c("JobInvolvement", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "WorkLifeBalance")
rawdata[,factorcolumns] <- lapply(rawdata[,factorcolumns], as.factor)
data0 <- rawdata #data0 - dataset that I use for the analysis
skim(data0)
| Name | data0 |
| Number of rows | 870 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 13 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobInvolvement | 0 | 1 | FALSE | 4 | 3: 514, 2: 228, 4: 81, 1: 47 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| JobSatisfaction | 0 | 1 | FALSE | 4 | 4: 271, 3: 254, 1: 179, 2: 166 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
| PerformanceRating | 0 | 1 | FALSE | 2 | 3: 738, 4: 132 |
| RelationshipSatisfaction | 0 | 1 | FALSE | 4 | 4: 264, 3: 261, 1: 174, 2: 171 |
| WorkLifeBalance | 0 | 1 | FALSE | 4 | 3: 532, 2: 192, 4: 98, 1: 48 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.0 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.5 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.0 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.0 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.0 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.0 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.5 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.0 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.0 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.0 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.0 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.0 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.0 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.0 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.0 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.0 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.0 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Then now we have 13 factor columns and 18 numeric columns in the dataset.
In the next part, I will do Exploratory Data Analysis (or EDA). First, I will analyze the dataset in each variable.
t.Age <- t.test(Age~Attrition, data=data2)
t.Age
##
## Welch Two Sample t-test
##
## data: Age by Attrition
## t = 4.1509, df = 184.91, p-value = 5.05e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.902905 5.350324
## sample estimates:
## mean in group No mean in group Yes
## 37.41233 33.78571
t.MonthlyIncome <- t.test(MonthlyIncome~Attrition, data=data2)
t.MonthlyIncome
##
## Welch Two Sample t-test
##
## data: MonthlyIncome by Attrition
## t = 5.3249, df = 228.45, p-value = 2.412e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1220.382 2654.047
## sample estimates:
## mean in group No mean in group Yes
## 6702.000 4764.786
t.Education <- t.test(Education~Attrition, data=data2)
t.Education
##
## Welch Two Sample t-test
##
## data: Education by Attrition
## t = 1.4738, df = 197.94, p-value = 0.1421
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04651063 0.32165740
## sample estimates:
## mean in group No mean in group Yes
## 2.923288 2.785714
t.DistanceFromHome <- t.test(DistanceFromHome~Attrition, data=data2)
t.DistanceFromHome
##
## Welch Two Sample t-test
##
## data: DistanceFromHome by Attrition
## t = -2.4218, df = 186.03, p-value = 0.01641
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.4992554 -0.3574961
## sample estimates:
## mean in group No mean in group Yes
## 9.028767 10.957143
t.EnvironmentSatisfaction <- t.test(EnvironmentSatisfaction~Attrition, data=data2)
t.EnvironmentSatisfaction
##
## Welch Two Sample t-test
##
## data: EnvironmentSatisfaction by Attrition
## t = 2.1362, df = 185.27, p-value = 0.03397
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01768369 0.44474293
## sample estimates:
## mean in group No mean in group Yes
## 2.738356 2.507143
t.JobLevel <- t.test(JobLevel~Attrition, data=data2)
t.JobLevel
##
## Welch Two Sample t-test
##
## data: JobLevel by Attrition
## t = 5.231, df = 211.76, p-value = 4.042e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2995698 0.6618784
## sample estimates:
## mean in group No mean in group Yes
## 2.116438 1.635714
t.NumCompaniesWorked <- t.test(NumCompaniesWorked ~Attrition, data=data2)
t.NumCompaniesWorked
##
## Welch Two Sample t-test
##
## data: NumCompaniesWorked by Attrition
## t = -1.6637, df = 183.57, p-value = 0.09788
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.91435932 0.07776441
## sample estimates:
## mean in group No mean in group Yes
## 2.660274 3.078571
t.PercentSalaryHike <- t.test(PercentSalaryHike ~Attrition, data=data2)
t.PercentSalaryHike
##
## Welch Two Sample t-test
##
## data: PercentSalaryHike by Attrition
## t = -0.42788, df = 187.22, p-value = 0.6692
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8596808 0.5532229
## sample estimates:
## mean in group No mean in group Yes
## 15.17534 15.32857
t.StockOptionLevel <- t.test(StockOptionLevel ~Attrition, data=data2)
t.StockOptionLevel
##
## Welch Two Sample t-test
##
## data: StockOptionLevel by Attrition
## t = 4.216, df = 187.92, p-value = 3.86e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1845700 0.5091678
## sample estimates:
## mean in group No mean in group Yes
## 0.8397260 0.4928571
t.TotalWorkingYears <- t.test(TotalWorkingYears ~Attrition, data=data2)
t.TotalWorkingYears
##
## Welch Two Sample t-test
##
## data: TotalWorkingYears by Attrition
## t = 5.1364, df = 201.19, p-value = 6.596e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.105259 4.728792
## sample estimates:
## mean in group No mean in group Yes
## 11.602740 8.185714
t.TrainingTimesLastYear <- t.test(TrainingTimesLastYear ~Attrition, data=data2)
t.TrainingTimesLastYear
##
## Welch Two Sample t-test
##
## data: TrainingTimesLastYear by Attrition
## t = 1.8954, df = 200.36, p-value = 0.05948
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.00876413 0.44301071
## sample estimates:
## mean in group No mean in group Yes
## 2.867123 2.650000
t.YearsAtCompany <- t.test(YearsAtCompany ~Attrition, data=data2)
t.YearsAtCompany
##
## Welch Two Sample t-test
##
## data: YearsAtCompany by Attrition
## t = 3.7256, df = 191.55, p-value = 0.0002563
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9922099 3.2248155
## sample estimates:
## mean in group No mean in group Yes
## 7.301370 5.192857
t.YearsInCurrentRole <- t.test(YearsInCurrentRole ~Attrition, data=data2)
t.YearsInCurrentRole
##
## Welch Two Sample t-test
##
## data: YearsInCurrentRole by Attrition
## t = 4.9513, df = 208, p-value = 1.522e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9306052 2.1619584
## sample estimates:
## mean in group No mean in group Yes
## 4.453425 2.907143
t.YearsSinceLastPromotion <- t.test(YearsSinceLastPromotion ~Attrition, data=data2)
t.YearsSinceLastPromotion
##
## Welch Two Sample t-test
##
## data: YearsSinceLastPromotion by Attrition
## t = 0.12796, df = 187.59, p-value = 0.8983
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5712911 0.6505475
## sample estimates:
## mean in group No mean in group Yes
## 2.175342 2.135714
t.YearsWithCurrManager <- t.test(YearsWithCurrManager ~Attrition, data=data2)
t.YearsWithCurrManager
##
## Welch Two Sample t-test
##
## data: YearsWithCurrManager by Attrition
## t = 4.6826, df = 209.75, p-value = 5.084e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.8262439 2.0277679
## sample estimates:
## mean in group No mean in group Yes
## 4.369863 2.942857
testnames <- c("Age","Monthly Income", "Education",
"Distance From Home", "Environment Satisfaction",
"Job Level", "Number of Companies Worked",
"Percent Salary Hike", "Stock Option Level",
"Total Working Years", "Training Times Last Year",
"Years At Company", "Years In Current Role",
"Years Since Last Promotion", "Years With Current Manager"
)
testpval <- c(t.Age$p.value,
t.MonthlyIncome$p.value,
t.Education$p.value,
t.DistanceFromHome$p.value,
t.EnvironmentSatisfaction$p.value,
t.JobLevel$p.value,
t.NumCompaniesWorked$p.value,
t.PercentSalaryHike$p.value,
t.StockOptionLevel$p.value,
t.TotalWorkingYears$p.value,
t.TrainingTimesLastYear$p.value,
t.YearsAtCompany$p.value,
t.YearsInCurrentRole$p.value,
t.YearsSinceLastPromotion$p.value,
t.YearsWithCurrManager$p.value
)
ttestout <- cbind.data.frame(testnames,testpval)
ttestout$testpval <- round(ttestout$testpval,10)
names(ttestout) <- c("Variable","P-Value")
The p-values of t-test analyses related to Attrition variable:
ttestout %>% kable()
| Variable | P-Value |
|---|---|
| Age | 0.0000505 |
| Monthly Income | 0.0000002 |
| Education | 0.1421319 |
| Distance From Home | 0.0164052 |
| Environment Satisfaction | 0.0339727 |
| Job Level | 0.0000004 |
| Number of Companies Worked | 0.0978823 |
| Percent Salary Hike | 0.6692297 |
| Stock Option Level | 0.0000386 |
| Total Working Years | 0.0000007 |
| Training Times Last Year | 0.0594832 |
| Years At Company | 0.0002563 |
| Years In Current Role | 0.0000015 |
| Years Since Last Promotion | 0.8983165 |
| Years With Current Manager | 0.0000051 |
In this part, we will have more information related to Monthly Income and Attrition. We will combine our analysis on each variable in the last part.
The following table will show us the correlations between numeric variables.
correlator <- function(df){
df %>%
keep(is.numeric) %>%
tidyr::drop_na() %>%
cor %>%
corrplot("upper", addCoef.col = "white", number.digits = 2,
number.cex = 0.5, method="square",
order="hclust",
tl.srt=50, tl.cex = 0.5)
}
correlator(data2)
#Actually, we can remove tidyr::drop_na() %>% because there is no missing data in this dataset
#We can write the same code directly without using function as follows
#data2 %>% keep(is.numeric) %>% na.omit %>% cor %>% corrplot("upper", addCoef.col = "white", number.digits = 2, number.cex = 0.5, method="square", order="hclust", tl.srt=50, tl.cex = 0.5)
#We can remove na.omit %>% because there is no missing data in this dataset
By this correlation table, we see
| Relationship | MonthlyIncome |
|---|---|
| JobLevel | 0.95 |
| TotalWorkingYears | 0.78 |
| YearsAtCompany | 0.49 |
| Age | 0.48 |
| YearsInCurrentRole | 0.36 |
| YearsWithCurrManager | 0.33 |
| YearsSinceLastPromotion | 0.32 |
Then by some analyses, we can see that Monthly Income have a strong correlation with Job Level. Month Income, Total Working Years and Job Level are correlated strongly. These correlations are logical.
We have
| Relationship | JobLevel |
|---|---|
| MonthlyIncome | 0.95 |
| TotalWorkingYears | 0.78 |
| YearsAtCompany | 0.52 |
| Age | 0.48 |
| YearsinCurrentRole | 0.39 |
| YearsWithCurrManager | 0.37 |
| YearsSinceLastPromotion | 0.33 |
Then by some analyses, we can see that Job Level have a strong correlation with Monthly Income. Month Income, Total Working Years and Job Level are correlated strongly. These correlations are logical.
By EDA1 - Analysis on each variable and by the correlation plot, we can see that some numerical variables have relationship with Attrition here (MonthlyIncome ~ TotalWorkingYears, JobLevel ~ TotalWorkingYears, YearsAtCompany ~ YearsinCurrentRole, YearsWithCurrManager ~ YearsinCurrentRole, YearsWithCurrManager ~ YearsAtCompany).
We will do another analysis in the next part to decide a good model.
Now I will use Automated EDA for Feature Selection. I use the library Boruta here to select important variables related to MonthlyIncome variable. (http://r-statistics.co/Variable-Selection-and-Importance-With-R.html)
boruta_output <- Boruta(MonthlyIncome ~ ., data=data2, doTrace=2)
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
print(boruta_signif) # significant variables
## [1] "Age" "Attrition"
## [3] "BusinessTravel" "Department"
## [5] "Education" "JobLevel"
## [7] "JobRole" "NumCompaniesWorked"
## [9] "TotalWorkingYears" "YearsAtCompany"
## [11] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [13] "YearsWithCurrManager"
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
Then after running this code, I have 14 important variables as follows: Age, Attrition, BusinessTravel, Department, Education, JobLevel, JobRole, NumCompaniesWorked, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager.
Similarly, I will run the following code
boruta_output <- Boruta(Attrition ~ ., data=rawdata0, doTrace=2)
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
print(boruta_signif) # significant variables
## [1] "Age" "Department" "JobInvolvement"
## [4] "JobLevel" "JobRole" "JobSatisfaction"
## [7] "MaritalStatus" "MonthlyIncome" "NumCompaniesWorked"
## [10] "OverTime" "StockOptionLevel" "TotalWorkingYears"
## [13] "WorkLifeBalance" "YearsAtCompany" "YearsInCurrentRole"
## [16] "YearsWithCurrManager"
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
Then I will use important variable to build a model later: Age, Department, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, NumCompaniesWorked, OverTime, StockOptionLevel, TotalWorkingYears, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager.
Here we will use our EDA4 to build model for Monthly Income.
I will build a multiple linear regression on Monthly Income related to 14 important variables in the last part.
set.seed(100)
data3 <- select(rawdata, "MonthlyIncome", "Age", "Attrition", "BusinessTravel", "Department", "Education", "JobLevel", "JobRole", "NumCompaniesWorked", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
splitPerc1 = 0.8
trainIndices1 = sample(1:dim(data3)[1],round(splitPerc1 * dim(data3)[1]))
train1 = data3[trainIndices1,]
test1 = data3[-trainIndices1,]
lm1 = lm(MonthlyIncome ~ ., data = train1)
summary(lm1)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3599.9 -618.2 -28.7 586.9 3976.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -460.4385 584.7780 -0.787 0.431341
## Age -1.6469 6.3150 -0.261 0.794330
## AttritionYes 104.0117 117.0640 0.889 0.374588
## BusinessTravelTravel_Frequently 191.2012 157.9819 1.210 0.226599
## BusinessTravelTravel_Rarely 409.2843 135.5348 3.020 0.002625 **
## DepartmentResearch & Development 228.2193 474.6301 0.481 0.630789
## DepartmentSales -351.1497 481.9957 -0.729 0.466541
## Education -30.3332 41.4089 -0.733 0.464101
## JobLevel 2791.0173 93.4593 29.863 < 2e-16 ***
## JobRoleHuman Resources -33.9349 558.5571 -0.061 0.951573
## JobRoleLaboratory Technician -676.1669 191.3353 -3.534 0.000437 ***
## JobRoleManager 4328.4088 305.8894 14.150 < 2e-16 ***
## JobRoleManufacturing Director 180.1699 192.5615 0.936 0.349790
## JobRoleResearch Director 4127.5414 247.1638 16.700 < 2e-16 ***
## JobRoleResearch Scientist -355.1583 192.6310 -1.844 0.065663 .
## JobRoleSales Executive 570.9446 391.4210 1.459 0.145129
## JobRoleSales Representative 95.3766 431.1777 0.221 0.825004
## NumCompaniesWorked -3.4078 18.8530 -0.181 0.856614
## TotalWorkingYears 49.3545 12.1666 4.057 5.56e-05 ***
## YearsAtCompany 1.9822 15.2147 0.130 0.896380
## YearsInCurrentRole -0.6587 18.5037 -0.036 0.971613
## YearsSinceLastPromotion 30.7322 17.2900 1.777 0.075945 .
## YearsWithCurrManager -45.0684 18.8133 -2.396 0.016867 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1066 on 673 degrees of freedom
## Multiple R-squared: 0.9492, Adjusted R-squared: 0.9475
## F-statistic: 571.4 on 22 and 673 DF, p-value: < 2.2e-16
pred1 = predict(lm1, test1)
RMSE = sqrt(mean((test1$MonthlyIncome - pred1)^2))
RMSE
## [1] 1031.816
Then RMSE is around $1031.816.
Here I will use the file CaseStudy2CompSet No Salary.xlsx and create new dataframe case2nosalary1 with the same variables as the dataframe data3 in the last step (Build a Model) (without MonthlyIncome variable).
case2nosalary <- read_excel("CaseStudy2CompSet No Salary.xlsx")
case2nosalary1 <- select(case2nosalary, "Age", "Attrition", "BusinessTravel", "Department", "Education", "JobLevel", "JobRole", "NumCompaniesWorked", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
Now I will use the model built in the last step to predicting the salary for the file CaseStudy2CompSet No Salary.xlsx.
lm2 = lm(MonthlyIncome ~ ., data = data3)
pred2 = predict(lm2, case2nosalary1)
MonthlyIncome = pred2
MonthlyIncome = as.data.frame(MonthlyIncome)
case2nosalary2 = cbind(case2nosalary, MonthlyIncome)
write.csv(case2nosalary2, "Case2PredictionsNguyenSalary.csv", row.names = FALSE)
case2nosalary3 <- read.csv("Case2PredictionsNguyenSalary.csv", header = T)
head(case2nosalary3)
## ID Age Attrition BusinessTravel DailyRate Department
## 1 871 43 No Travel_Frequently 1422 Sales
## 2 872 33 No Travel_Rarely 461 Research & Development
## 3 873 55 Yes Travel_Rarely 267 Sales
## 4 874 36 No Non-Travel 1351 Research & Development
## 5 875 27 No Travel_Rarely 1302 Research & Development
## 6 876 39 Yes Travel_Rarely 895 Sales
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 2 4 Life Sciences 1 1849
## 2 13 1 Life Sciences 1 995
## 3 13 4 Marketing 1 1372
## 4 9 4 Life Sciences 1 1949
## 5 19 3 Other 1 1619
## 6 5 3 Technical Degree 1 42
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 1 Male 92 3 2
## 2 2 Female 53 3 1
## 3 1 Male 85 4 4
## 4 1 Male 66 4 1
## 5 4 Male 67 2 1
## 6 4 Male 56 3 2
## JobRole JobSatisfaction MaritalStatus MonthlyRate
## 1 Sales Executive 4 Married 19246
## 2 Research Scientist 4 Single 17241
## 3 Sales Executive 3 Single 9277
## 4 Laboratory Technician 2 Married 9238
## 5 Laboratory Technician 1 Divorced 16290
## 6 Sales Representative 4 Married 3335
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 1 Y No 20 4
## 2 3 Y No 18 3
## 3 6 Y Yes 17 3
## 4 1 Y No 22 4
## 5 1 Y No 11 3
## 6 3 Y No 14 3
## RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 80 1
## 2 1 80 0
## 3 3 80 0
## 4 2 80 0
## 5 1 80 2
## 6 3 80 1
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 7 5 3 7
## 2 5 4 3 3
## 3 24 2 2 19
## 4 5 3 3 5
## 5 7 3 3 7
## 6 19 6 4 1
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 7 7 7
## 2 2 0 2
## 3 7 3 8
## 4 4 0 2
## 5 7 0 7
## 6 0 0 0
## MonthlyIncome
## 1 5665.948
## 2 2730.876
## 3 12189.751
## 4 1954.560
## 5 2356.053
## 6 6102.791
Later, I will upload the preditecd file Case2PredictionsNguyenSalary.csv for Salary into github.
Here I will use our EDA4 to build model for Attrition. I will use these vatiables: Age, Department, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, NumCompaniesWorked, OverTime, StockOptionLevel, TotalWorkingYears, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager.
I will use Naive Bayes classifiers to build a model here.
data4 <- select(rawdata, "Attrition", "Age", "Department", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "NumCompaniesWorked", "OverTime", "StockOptionLevel", "TotalWorkingYears", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager")
set.seed(10000)
splitPercNB1 = 0.8
trainIndicesNB1 = sample(1:dim(data4)[1],round(splitPercNB1 * dim(data4)[1]))
trainNB1 = data4[trainIndicesNB1,]
testNB1 = data4[-trainIndicesNB1,]
NB1 <- naiveBayes(Attrition~.,data = trainNB1,laplace = -1)
predNB1 = predict(NB1, testNB1)
table(as.factor(testNB1$Attrition),predNB1)
## predNB1
## No Yes
## No 125 26
## Yes 8 15
#Confusion Matrix
confusion.Matrix = confusionMatrix(predNB1,as.factor(testNB1$Attrition))
confusion.Matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 125 8
## Yes 26 15
##
## Accuracy : 0.8046
## 95% CI : (0.7378, 0.8607)
## No Information Rate : 0.8678
## P-Value [Acc > NIR] : 0.992836
##
## Kappa : 0.3604
##
## Mcnemar's Test P-Value : 0.003551
##
## Sensitivity : 0.8278
## Specificity : 0.6522
## Pos Pred Value : 0.9398
## Neg Pred Value : 0.3659
## Prevalence : 0.8678
## Detection Rate : 0.7184
## Detection Prevalence : 0.7644
## Balanced Accuracy : 0.7400
##
## 'Positive' Class : No
##
Sensitivity = confusion.Matrix$byClass['Sensitivity']
Specificity = confusion.Matrix$byClass['Specificity']
Accuracy = confusion.Matrix$overall['Accuracy']
Accuracy
## Accuracy
## 0.8045977
Sensitivity
## Sensitivity
## 0.8278146
Specificity
## Specificity
## 0.6521739
Here I will use the file CaseStudy2CompSet No Attrition.csv and create new dataframe case2noattrition1 with the same variables as the dataframe data4 in the last step (Build a Model) (without Attrition variable).
case2noattrition <- read.csv("CaseStudy2CompSet No Attrition.csv")
case2noattrition1 <- select(case2noattrition, "Age", "Department", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "NumCompaniesWorked", "OverTime", "StockOptionLevel", "TotalWorkingYears", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager")
Now I will use the model built in the last step to predicting Attrition for the file CaseStudy2CompSet No Attrition.csv.
set.seed(10000)
NB2 <- naiveBayes(Attrition~.,data = data4,laplace = -1)
predNB2 = predict(NB2, case2noattrition1)
## Warning in predict.naiveBayes(NB2, case2noattrition1): Type mismatch
## between training and new data for variable 'JobInvolvement'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(NB2, case2noattrition1): Type mismatch
## between training and new data for variable 'JobSatisfaction'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(NB2, case2noattrition1): Type mismatch
## between training and new data for variable 'WorkLifeBalance'. Did you use
## factors with numeric labels for training, and numeric values for new data?
Attrition = predNB2
Attrition = as.data.frame(Attrition)
case2noattrition2 = cbind(case2noattrition,Attrition)
write.csv(case2noattrition2,"Case2PredictionsNguyenAttrition.csv", row.names = FALSE)
case2noattrition3 <- read.csv("Case2PredictionsNguyenAttrition.csv", header = T)
head(case2noattrition3)
## ID Age BusinessTravel DailyRate Department
## 1 1171 35 Travel_Rarely 750 Research & Development
## 2 1172 33 Travel_Rarely 147 Human Resources
## 3 1173 26 Travel_Rarely 1330 Research & Development
## 4 1174 55 Travel_Rarely 1311 Research & Development
## 5 1175 29 Travel_Rarely 1246 Sales
## 6 1176 51 Travel_Frequently 1456 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 28 3 Life Sciences 1 1596
## 2 2 3 Human Resources 1 1207
## 3 21 3 Medical 1 1107
## 4 2 3 Life Sciences 1 505
## 5 19 3 Life Sciences 1 1497
## 6 1 4 Medical 1 145
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 46 4 2
## 2 2 Male 99 3 1
## 3 1 Male 37 3 1
## 4 3 Female 97 3 4
## 5 3 Male 77 2 2
## 6 1 Female 30 2 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Laboratory Technician 3 Married 3407
## 2 Human Resources 3 Married 3600
## 3 Laboratory Technician 3 Divorced 2377
## 4 Manager 4 Single 16659
## 5 Sales Executive 3 Divorced 8620
## 6 Healthcare Representative 1 Single 7484
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 25348 1 Y No 17
## 2 8429 1 Y No 13
## 3 19373 1 Y No 20
## 4 23258 2 Y Yes 13
## 5 23757 1 Y No 14
## 6 25796 3 Y No 20
## PerformanceRating RelationshipSatisfaction StandardHours
## 1 3 4 80
## 2 3 4 80
## 3 4 3 80
## 4 3 3 80
## 5 3 3 80
## 6 4 3 80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1 2 10 3 2
## 2 1 5 2 3
## 3 1 1 0 2
## 4 0 30 2 3
## 5 2 10 3 3
## 6 0 23 1 2
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1 10 9 6
## 2 5 4 1
## 3 1 1 0
## 4 5 4 1
## 5 10 7 0
## 6 13 12 12
## YearsWithCurrManager Attrition
## 1 8 Yes
## 2 4 Yes
## 3 0 Yes
## 4 2 Yes
## 5 4 Yes
## 6 8 Yes
Later, I will upload the preditecd file Case2PredictionsNguyenAttrition.csv for Salary into github.